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Abstract 

We introduce methods for efficiently computing the Gaussian likelihood for station¬ 
ary Markov random field models, when the data locations fall on a possibly incomplete 
regular grid. The calculations rely on the availability of the covariances, which we show 
can be computed to any user-specified accuracy with fast Fourier transform algorithms. 
Several methods are presented, covering models with and without additive error terms 
and situations where either conserving memory or reducing computation time are fa¬ 
vored, and some of the algorithms are easily parallelized. The examples presented 
highlight frequentist inference, but access to the likelihood allows for Bayesian infer¬ 
ence as well. We demonstrate our results in simulation and timing studies and with 
an application to gridded satellite data, where we use the likelihood both for parame¬ 
ter estimation and likelihood ratio model comparison. In the data analysis, stochastic 
partial differential equation approximations are outperformed by an independent block 
approximation. 


1 Introduction 


In a wide range of scientific and engineering applications, such as imaging, agricultural 
field trials, climate modeling, and remote sensing, spatial data are observed or reported 
on regular grids of locations, and the Gaussian process model is co mmonly used to 


model the data direct ly, or indirectly as a stage in a hierarchical model ()R,ue et al.l . 120091 : 
Baneriee et ah . 2014l l. Let Y[x) ^ M., x £ I? denote a real random variable indexed by 


the two-dimensional integer lattice. We say that Y{-) is a Gaussian process if, for any 
n € N and any set of locations xi^n = {xi, the vector Y = (Y (®i ),... ,Y {xn))' 


*jsguinne@ncsu.edu 

iipsen@ncsu.edu 


1 

















has a multivariate normal distribution. A common decomposition of a Gaussian process 
is 


Y (x) = fi/ 3 (x) + Z(x) + £(x), 


where npix) is a nonrandom function depending on mean parameter /3, Z{-) is a 
Gaussian process with mean E{Z{x)) = 0 and covariance function Gov(Z(a;), Z[y)) = 
Kg{x,y) depending on covariance parameter 9, and £{x) 1 . 1 . d. A^(0 ,it^). The usual 
terminology in this scenario is to call Z{-) a latent process, since it is not directly 
observed, and to say that the model contains a nugget effect, referring to the presence 
of the £{x) term, which is used to model microscale variation or measurement error. 

We define the mean vector /.t^ := E(Y) = {yp{xi), ... , y,p{xn))' ■, the random vector 
Z = [Z[xi),..., Z[xn))' and its covariance matrix Tig. Then the covariance matrix 
for "K is Sf) + cr^I, where I is an identity matrix of appropriate size. The parameters 
can be estimated using the loglikelihood function for [3,9, and from Y 

L{l3,9,a‘^) = -^log(27r) - ^ log det(S 0 + cr^I) - ^{Y - fX/sYiTg + a'^iy^iY - /x^). 


Memory and time constraints begin to prohibit storing and factoring Tg+a'^I when n is 
between 10^ and 10^. Since the inclusion of a non-zero mean vector does not introduce 
a substantial additional computational burden, we assume = 0 in our discussion of 
the methods, to simplify the formulas and focus our attention on estimating covariance 
and nugget parameters. 

There is significant interest in identifying flexible covariance functions Kg{x,y) for 
which the l ikeliho od function-or an approximation to it-can be comp uted efficiently 


( Suri et ah . 20121 ). Gaussian Markov random field (GMRF) models ( R.ue and Held . 


2005l i are of particular value because they provide a flexible class of models that in¬ 


duce sparsity in a feature that can be exploited by sparse matrix factorization 
algorithms. A GMRF on the infinite integer lattice is naturally specified by the 
conditional expectation and variance of each observation given every other observation 
on 7?. Formally, we write 


Z{x)\{Z{y):yeZ^\x}r^N 


- T 

yGWP‘\x 


G{x,y) 
9{x, x) 


Z{y), ll9{x,x) , 


( 1 ) 


where 9{-, •) is a nonrandom function that encodes the conditional specification of Z{-). 
We use the symbol 9 for this function and for the vector of covariance parameters since 
they both control the covariance function of Z{-). If for every x, 9[x,y) is zero for all 
but a finite number of locations y, then the random field is said to be Markov. GMRFs 
can also be viewed as approx imations to more general classes of Gaussian random fields 
(jRue and Tielmelandl . l2002li. GMRFs are also u sed as a stage in other spati al hierarchi¬ 


cal rn ultiresolution models (jNvchka et al.l . 120141 ) and non-Gaussian models (|R.ue et al 


2009l b The most common approaches to achieving efficient Bayesian inference with la. - 


tent GMRFs involve Markov chain Monte Carlo fMCMC) ( Knorr-Held arid Rue . 2002l l 
or an integrated nested Laplace approximation (INLA) ( Rue et al. . 20091 ) to make in¬ 
ferences about 9 and cr^. For large datasets, the usual MCMC methods involve repeated 
proposals of Z to integrate out the latent GMRF. 

INLA uses a clever rewriting of the likelihood for the GMRF parameters given 
Y to avoid the need to integrate over the distribution of Z. In principle, this same 
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approach could be used in MCMC as well. However, when using a finite sample of 
observations to make inferences about parameters controlling processes defined on in - 
finite lattices, edge effects are known to introduce nonneglible biases (|Guvonl . |1982). 
For GMRFs, the issue is that conditional distributions of observations on the bound¬ 
ary of the study region given only the observations on the interior cannot be easily 
expressed in terms of 9 and depend on the particular configuration of the observa¬ 
tions. This is also true for observations on the interior whose neighbors are missing. 
A common approach to mitigating edge effects is to expand the lattice and treat ob¬ 


servations on the expanded lattice as mis s ing (iPaciorekl . 120071 : iLindgren et al.l . 12011 


Stroud et ah . 2014 : Guinness and Fuentesl . 2014l l. This approach is generally approx¬ 


imate in nature, and the practitio ner must dec^e how much lattice expansion is re- 
quired to mitigate the edge e ffects. Stroud et all ( 2014l l describe an exact method, but 
Guinness and Fuentes ( 2014l l argue that the tradeoff for exactness is a large number of 


missing values that mu s t be i mputed, slowing the convergence of iterative algorithms. 
Besag and Kooperberg ( 19951 ) give an approximate method for mitigating edge effects 


based on modifying an appr oximation to the pr e cision matrix when the data form 
complete rectangular gr id. IPutta and Mondall ( 20141 1 use approximate likelihoods 


based on the h-likelihood ( Henderson et ah . 1959ll but do not address edge effects. 


We introduce methods for efficiently computing of the Gaussian likelihood for 9 and 
from Y when the random field stationary, so when our methods can be applied, 
there is no need to impute missing values or expand the lattice to a larger domain. 
We do not impose any unrealistic boundary conditions on the spatial modeHthe like¬ 
lihood is for the infinite lattice model observed on a finite lattice. The methods rely 
on the availability of the covariances, which we show can be computed efficiently and 
to very high precision with fast Fourier transform (FFT) algorithms, and we provide 
a theorem and a numerical study as support. The computational methods make use 
of circulant embedding techniques as well as sparse matrix algorithms. The computa¬ 
tions are efficient in both memory allocation and floating point operations, as long as 
the observations form a nearly complete grid. The boundary of the grid need not be 
rectangular or even convex, and thus the methods can be used in diverse applications, 
particularly those in which the boundary of the study region is determined by irregular 
geography. Indeed, we apply our methods to a set of aerosol optical thickness values 
observed over the Red Sea, which has an irregular coastline and several islands over 
which the data are not reported. In Section O we describe our new methods for com¬ 
puting the exact likelihood, as well as discuss some existing likelihood approximations. 
Section [3] contains simulation and timing results showing cases in which the existing 
likelihood approximations fail due to boundary effects, and demonstrates the computa¬ 
tional efficiency of our new methods. The analysis of the Red Sea data is in Section U 
where we highlight the role of the exact likelihood for both parameter estimation and 
model comparison, and where stochastic partial differential equation approximations 
are outperformed by much simpler independent blocks likelihood approximations. We 
conclude with a discussion in Section [5l 


2 Efficient Gaussian Likelihood Calculations 

In this section, we describe our theoretical results and methods for computing like¬ 
lihoods for stationary GMRF models. For any x G define Sg{x) to be the set 
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Figure 1: Illustration of the fully and partially neighbored observations with neighborhood 
set consisting of four adjacent grid points. Black circles indicate observation locations, and 
those with a gray interior are the fully neighbored locations. 


{y G J? : 6{x, y) 7 ^ 0}, which we refer to as the neighborhood set of x under 9. If the 
random field is Markov, then it is possible that all neighbors of an observation location 
are contained in the finite set of all observation locations. 


Definition: Let 1 < i < n. If SQ{xi) C a;i:n, we say that Xi is a fully neighbored 
location, and Z{xi) is a fully neighbored observation. If a location or observation is not 
fully neighbored, we say that it is partially neighbored. 

We define < n to be the number of partially neighbored observations among 
xi-n- The partially neighbored observations consist of those along the boundary of 
the study region and any observations on the interior of the study region that have at 
least one missing neighbor. Figured] contains an illustration. For example, if xi-n is a 
complete square grid of dimension (1000,1000), and each location’s neighborhood set 
is the four adjacent grid points, = 3996, while n = 10®. Practical application of 
our methods depends on the following lemma, so we state it here first to give context 
to the theoretical results that follow. 


Lemma 1. Let xi-n a finite set of observation locations for the infinite GMRF 
Z{-) specified by 6, let Qo = the inverse of the covariance matrix for Z, and let 
Xi and Xj be two observation locations. If either Xi or Xj is fully neighbored, then 
Qe[i,j] = 9{xi,Xj). 

Let P be a permutation matrix for which PZ = {Z[, Z'^' reorders the components 
of Z so that Z\ contains the partially neighbored observations and Z^ contains 
the n — mn fully neighbored observations. Then define the block matrices 




Til Ti2 
T21 T22 


PQeP' 


Qii Q\2 

Q 21 Q 22 ’ 


( 2 ) 


so that Til the covariance matrix for Zi, and S 22 is the covariance matrix for Z 2 . 
Since Q 12 , Q 21 and Q 22 all contain either a row or a column corresponding to a fully 
neighbored observation, they are all sparse and have entries given by Lemmadl On the 
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other hand Qn is not guaranteed to be sparse, and its entries cannot be exactly deter- 



When 9{x,y) can be expressed as r]{x — y) for every x and y, the random field is 
said to be stationary. In this case, the covariance function has a spectral representation 


Ke{x,y) = 


f exp{iuj'{x - y)) 

^ 2 ^]" T,hGZ^v{h)exp{iu'h) 




(3) 


where the spectral density r/(/i) exp(icj'/i)) ^ is required to be non¬ 

negative for all cj E [0,27r]^. Assume that all of the observation locations fall inside a 
rectangular grid of size n = (ni,n 2 ), and the axes have been defined so that n^ < n^ - 
To compute the covariances, we follow a suggestion by iBesag and Kooperbergi (jl995l l 
to numerically evaluate the integrals in ([3|). We use the sums 


Ke{x,y; J,n) = 




nin2 


) exp {iu'j{x-y)), 


(4) 


j&Fj 


where Uj = 27r/J(ji/?T'i, j 2 /f^ 2 ) are Fourier frequencies on a grid Fj of size (niJ, n 2 J). 
The following theorem gives a bound on the convergence rate of this method for cal¬ 
culating the covariances. 

Theorem 1. If /{uj) is the spectral density for a Markov random field on 1? and is 
bounded above, then for J > 2 and any p £ N, there exists a constant Cp < oo such 
that 

(j 

\Ke{x,y) - Ke{x,y] J,n)| < 

We provide a proof in Appendix |Bl In other words, the error decays faster than a 
polynomial in (ni J) of degree p for every p> 1, so the sums converge quickly with J, 
and the error is smaller for a given J with larger ni. For large observation grids and 
common GMRF models, the covariances can be computed to an absolute error of less 
than 10“^® with J as small as 2 or 3. We provide a numerical study in Appendix O 
FFT algorithms allow us to compute covariances at all of the required lags x — y 'm. 
0 ((nin 2 J^) log(nin 2 J^)) floating point operations and 0 (nin 2 J^) memory. 

The following proposition gives convenient expressions for the determinant and 
inverse of Eg that we use to efficiently evaluate the Gaussian loglikelihood in the case 
when the model does not contain a nugget. 

Proposition 1. //Eg and Qe are partitioned as in ([2]), then 
(a) det(E0) = det(Eii)/det(( 522 ) 


7 

— Eii^Ei2 


fy-l 

0 ■ 


I 

O' 

0 

I 


0 

Q 22 


— E2iE^j^’^ 

I 


Proposition [T] can be viewed as a matrix version of the decomposition of the likelihood 
p(zi,Z 2 ) = p{zi)p[z 2 \zi), since Z\ is normal with mean 0 and covariance matrix En, 
and .^ 21-^1 is normal with mean Yi 2 i^ii and covariance matrix which is the 
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Schur complement of Sn. The entries of Sn are obtained with an FFT, as in Theorem 
[H and the entries of Q 22 are known analytically. The various quantities required for 
evaluating the determinant and for multiplying ^ by Y can be easily computed after 
the Cholesky factorizations of Sn and Q 22 have been obtained. Neither factorization 
is computationally demanding because Sn, though dense, is only of size x m„,, and 
Q 22 ) though large, is sparse and can be factored with sparse Cholesky algorithms. Mul- 
tiplication of ^91 or S 10 by a vector is completed with circulant embedding techniques 
( Wood and Chan . 1994 ). 

Including a nugget in the model increases the computational demands because (SeT 
is typically dense, eve n if is sparse. R e cent work on sol v ing diagonally per¬ 
turbed linear systems inc l ude Bellavia et ah J 2011 Du et ah (l2015l i ; Erwav and Marcia 
( 2012 ! ): Soodhalter et ah ( 201^ Vecharvnski and Knvazev ( 201.41 ). Here, we exploit the 
fact that most of the inverse of the unperturbed matrix is known and sparse, and we 
provide methods for evaluating the determinant in addition to solving linear systems. 
The covariance matrix for Y can be decomposed as 


Se + = Se(/ + a^Qe) = + Qe^I, 


(5) 


and thus the following proposition holds. 

Proposition 2. 

(a) det(S 5 i + cr^I) = det(S 5 i) det(I + a’^Qe) 

(b) (Se + cT^I)-! = (/ + a 2 g 0 )-ig 0 . 


Equation ([5]) and Proposition [2] can be viewed as a matrix version of the decom¬ 
position of the likelihood p{y) = p{z)p{y\z)/p{z\y), since the covariance matrix for 
Z is Se, th e covariance matrix fo r Y\Z is cr^I, and the precision matrix for Z\Y is 
+ Qo (iLindgren et al.l . I 2 OIII ). Exploiting this factorization requires one to know 
the dense matrix Qu, but as a consequence of Theorem [H Qu can be computed as 

Qii = -|- Q 12 Q 22 Q 2 I: 


which requires inversion of En, nin solves with the large sparse matrix Q 22 ) follwed by 
a multiplication with Q 12 , which is also sparse. In practice, we have found that the 
solves are the most time-consuming step, but the solves can be completed in parallel. 

In the examples we consider, it is possible to compute and store the Cholesky 
factors of Qe and I -|- a^Qe if their rows and columns are appropriately reordered, so 
Proposition [2] can be used directly to compute the Gaussian likelihood in these cases. 
However, storing the Cholesky factors could be a computationally limiting constraint 
in some cases. The following proposition gives expressions for the determinant and 
inverse of -|- a^I that can be used to evaluate the Gaussian loglikelihood function 
without storing any dense matrices larger than size x nin- 

Proposition 3. Let A := I + cr'^Qe and B := A~^ he partitioned as in ([2|). Then 

(a) det(E 5 i -|- fj^J) = det(E 0 ) det(H 22 ) det(H{'^^) 

-a^A^^Q2i l\ 

and = I + a^Qu - o-'^( 3 i 2 ^^ 2 ^Q 2 i- 


(6) (E, + u"/)-i = 


I —(T^(5i2^22 
0 / 


Bn 0 
0 


“-22 _ 
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Since Bn is the same size as Sn and A 22 is as sparse as <5221 most of the calculations 
are analogous to those that appear in Proposition [H so we omit most of the details 
here, noting however that multiplications with Qq are fast since it is mostly sparse, 
and we first form the matrix which is the Schur complement of A 22 -, then factor 

it to compute its determinant and solve linear systems with it. Further, when forming 
B^^, we do not need to store the entire matrix A 22 Q 21 , which is dense and of size 
n X rrin- Rather, we complete the solves involving each column <521 [^j] individually, 
either in sequence or in parallel, only storing Qi 2 A 22 Q 2 i[-,j]- 


2.1 Kriging and Conditional Simulations 

The inverse results in Propositions [U [21 and [3] can also be used to perform Kriging of 
the data to unobserved locations, since the computationally limiting step in Kriging 
is solving a linear system involving the covariance matrix and the data. Let To be a 
vector of unobserved values that we wish to predict using the vector of observations Y 
. The joint distribution of the two vectors is 


Y 




Eg + a'^I 

So 

.^ 0 . 

_R0. 

1 

[ 

Soo + 


where fj, and /xq are the two mean vectors, Sq is the cross covariance matrix between 
Y and Yq, and Sqo + cr‘^I is the covariance matrix for the unobserved values. Then the 
conditional expectation of Yq given Y is 

E{Yo\Y) = fio + J:^{j:0 + a^I)-\Y-fi) 


The inverse results are used to solve the linear system with + a^I, and the forward 
multiplication with Sq can be computed with circulant embedding techniques. 

Conditional simulations of Yq given Y do not require much additional computa¬ 
tional effort. The conditional simulations consist of the conditional expectation (com¬ 
puted above), added to a simulation of a conditional residual with covariance matrix 
Sqo + <7^/ — -b The conditional residual can be formed with stan¬ 

dard methods that are not more computationally demanding than the original Kriging 
computations, as long as o ne can generate unconditional simulations. See for example 
Chiles and Delfinerl (j2012l . Section 7.3.1). We use circulant embedding for the uncon¬ 


ditional simulations. 


2.2 Existing Approximations to Qg 

We consider some existing likelihood approximations as competitors to our methods 
for computing the exact likelihood. The likelihood approximations replace Qg = ^ 

with an approximation Qg, with the various approximations differing in how Qg\i,j] 
is defined when both Xi and Xj are partially neighbored, that is, how Qn is defined. 
Thus each approximation can be viewed as a method for dealing with edge effects and 
missing interior values. We may prefer to use a likelihood approximation if it is faster 
to compute and provides similar parameter estimates to the exact maximum likelihood 
parameters. We describe three approximations here and evaluate their effectiveness in 
Section [3l 
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No Adjustment: The simplest approximation defines Qeihj] = S{xi,Xj) for every i and 
j. Rue and Heidi ( 2005l l prove that Qg is positive definite when dehned in this way. 


Precision Adjustment: When the conditional specification satisfies the diagonal domi¬ 
nance criterion 


9{xi,Xi) = X ^ \d{xi,Xj)\ 

Xj ^lP'\Xi 

with A € [0,1), we set Qe[i-,j] = 0{xi, Xj) for i ^ j and 

Qg[i,i]=\ ^ \e{xi,Xj)\ 

Xj e® 1:71 j 

when both Xj and Xj are partially neighbored. Then Qg is symmetric and diagonally 
dominant, and thus positive definite. 

Periodic (Toroidal) Adjustment: Let xi-n be a complete rectangular subset of the 
integer lattice of dimensions n = (ni,n 2 ), so that n = nin 2 . Then we set 

Qg[i,j] = 9{xi -Xj + ko n), 

ki = arg minfcg{_i o,i}|a;fi - x^j + kn(\ 

for i = 1,2^ where k = (ki, ^ 2 ), and k o n = {kiui, k 2 n 2 )- 


3 Simulation and Timing Experiments 


In this section we study the maximum likelihood and maximum approximate likelihood 
parameter estimates in simulation and timing experiments. For the simulations, we 
choose to consider the case of no nugget, so that measurement error does not obscure the 
importance of edge effects. This allows us to study how the approximate edge correction 
techniques perform compared to inference using the exact likelihood. Existing MCMC 
and INLA methods are designed for Bayesian inferene, and thus are not applicable here. 
We do, however, consider the performance of INLA in the data analysis in Section [H 
We consider spectral densities of the form 


{u) = + 4 - 


( 6 ) 


where 9 = (r, k, u) with t,k>0 and n = 0,1,2,.... Lindgren et al.l (j2nil ) showed that 
the spectral density in Q is explicitly linked to the spectral density of the Matern co- 
variance function with integer smoothness parameter n and inverse range parameter k. 
When = 0, we have 9(0) = t^(k^ + 4), 9(h) = —when h = (1, 0), (—1,0), (0,1), or 
(0,-1), and 0 otherwise. The n = 0 i nodel has been called a symmetric first order au¬ 
toregression ( Besag and Kooperberel . 199f)l ). although an alternative parameterization 


IS more common. 


The following simulation studies demonstrate that edge effects adversely affect pa¬ 
rameter estimation, especially when the spatial correlation is strong. We consider the 
cases of u = 0 and u = 1 with k G {1/5,1/10,1/20} and We conduct 100 simulations 















V = 0, K = 1/5 


v = 0, K= 1/10 


V = 0, K = 1/20 



v = 1,K= 1/5 



V=1,K=1/10 v=1,k=1/20 




Figure 2: Simulation results for maximum approximate likelihood estimates (black crosses) 
and maximum exact likelihood estimates (dark gray circles) for 100 simulated helds at each 
parameter combination. The thin solid lines indicate the sample means of the parameter 
estimates, with the sample mean of k taken on the log scale. The thick light gray lines 
indicate the true parameter values. 


on a grid of size (100,100) for each of the six parameter combinations. We compare 
the maximum approximate likelihood estimates to the maximum likelihood estimates, 
computed with our efficient methods. For the i/ = 0 case, we find that the precision 
adjustment provides the best approximation. For the 1 ^ = 1 case, we find that the pe¬ 
riodic adjustment provides the best approximation, although no adjustment performs 
similarly. The results are presented in Figure [2l In Appendix [Cl we provide a figure 
containing simulated fields with these six parameter combinations. The precision ad¬ 
justment introduces a small bias when = 0. When ly = 1, the periodic adjustment 
performs poorly in every case, with the bias of the estimates increasing as k decreases. 
The maximum likelihood estimates are nearly unbiased in every case. 

When there is no nugget in the model, the exact likelihoods not only produce 
more reliable parameter estimates, they often do not impose a significant additional 
computational burden over the calculation of the approximate likelihoods. In Table 
m we present the results of a timing study for the ly = 0 and ly = 1 cases using no 
nugget and “no adjustment” neighborhood structure for the approximate likelihoods, 
which is the sparsest approximation. We compute the exact loglikelihood both with 
and without a small nugget. We use complete square lattices of increasing size, and 
time the computations using the “clock” and “etime” functions in Matlab. In both the 
ly = 0 and the ly = 1 case with no nugget, the approximate likelihood is only 1.25 times 
faster than the exact likelihood for the largest grid. Adding a nugget increases the 
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u=0 u=l 


n 

Approximate 
a2 = 0 

Exact 
a2 = 0 

Exact 
ct 2 = 0.01 

Approximate 
a2 = 0 

Exact 
a2 = 0 

Exact 
ct 2 = 0.01 

100 ^ 

0.05 

0.06 

0.16 

0.06 

0.09 

0.89 

1502 

0.09 

0.13 

0.55 

0.22 

0.27 

2.38 

2002 

0.22 

0.33 

1.20 

0.55 

0.64 

6.90 

2502 

0.47 

0.56 

2.17 

1.25 

1.41 

15.06 

3002 

0.80 

0.95 

3.72 

1.95 

2.45 

22.72 


Table 1: Time in seconds for one evaluation of the approximate likelihood and the exact 
likelihood, with and without a nugget term. 

time to compute the exact likelihood, but we note that the largest computations are 
infeasible without our new methods, due to memory constraints on naive methods, and 
we have not parallelized the solves required for constructing Qii- All computations are 
completed using Matlab R2013a on an Intel Core 17-4770 CPU (3.40GHz) with 32 GB 
of RAM. 


4 Application to Satellite Data 

The NASA Aqua satellite carries a moderate resolution imaging spectrometer (MODIS) 
capable of acquiring radiance data at high spatial resolution. The data are processed 
in order to obtain derived measurements of physical quantities relevant to land, ocean, 
and atmospheric dynamics, at various timescales. We analyze a set of aerosol optical 
thickness (AOT) data from Autumn 2014 over the Red Sea. The data were downloaded 
from NASA’s OceanGolor project website. Aerosol optical thickness is a unitless quan¬ 
tity describing the radiance attenuation caused by aerosols in the atmosphere. In 
Figure [3l we plot the data, which consist of 21,921 observations forming an incomplete 
regular grid. Even when gridded, satellite data often contain many missing values due 
to lack of coverage, atmospheric disturbances, or the quantity not being defined in 
certain regions (e.g. sea surface temperature over land). 

Our methods for computing the likelihood are useful both for estimating parameters 
and for comparing models, allowing us to conduct model selection and to compare 
various methods for estimating parameters. We highlight both of these uses in the 
analysis of the AOT data. We model the data as 

Y{x) = 12 + Z{x) + £{x), 

where /i € M is unknown and nonrandom, Z{-) is a GMRF with spectral density 
as in ([6|), with v = 1, and unknown nonrandom parameters k,t > 0, and £{x) ~ 
i.i.d. iV(0,iT^). For this choice of = 1, and this set of observation locations, we have 
nin = 3,469. We estimate and by maximum likelihood, being careful to 

prohle out /r and r, which have closed form expressions for the values that maximize 
the likelihood given hxed values of the other parameters. We also estimate n, t, and 
K, hxing = 0, allowing for a comparison between models with and without a nugget 
effect. Adding a nugget increased the loglikelihood by 594 units, indicating that the 
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Figure 3: Aerosol optical thickness in the Red Sea in Autumn 2014. White pixels are missing 
values. 


nugget significantly improves the model. The parameter estimates and loglikelihoods 
for these two models are given in Table [2l 

Having access to the exact likelihood allows for likelihood ratio comparisons, a con¬ 
venient way to compare parameter estimates found using various approximate methods. 
We estimate parameters with two well-ki iown l ikeliho od approximations; the inde¬ 
pendent blocks likelihood approximation (jSteinl . l2014lh an d an approximation base d 


on an incomplete inverse Cholesky factorization dVecchial . IlflRSl : IStein et al.l . 120041 1 


W e also estimate p arame ters using the R-INL A software (www.r-inla.org), described 
B,ue et al. 1 20091 ) and Martins et al. ( 2013), using a stochastic partial differential 


m m,ue et al.l (jzuyaj ana iiviartins et ai.i iizui.ni. using a stocnastic partial ainerentiai 
equation (SPDE) approximation ( Linderen et al. . 201 il l. Since INLA is a Bayesian 


method, it does not return maximum likelihood estimates, but it is worthwhile to con¬ 
sider whether the INLA maximum posterior parameter estimates nearly maximize the 
likelihood, since INLA is designed specifically to estimate models with a latent GMRF. 

There are several decisions that must be made when using the approximate meth¬ 
ods. With the independent blocks method, we must choose the size and shape of the 
blocks. We partition the Red Sea according to the orientation in Figure 01 so that each 
block has roughly 1600 observations, allowing the covariance matrices for the individual 
blocks to be stored and factored using standard dense methods. With the incomplete 
Cholesky method, we choose smaller blocks, always using the adjacent block in the 
northwest direction as the conditioning set. In R-INLA, there are several decisions 
that must be made, including whether to force the nodes of the SPDE mesh to coin¬ 
cide with the observation locations, and how fine a mesh to use. And of course the 
priors must be chosen; we used the default priors. We did not enforce the mesh nodes 
to coincide with the grid. We tried values of 2, 4, and 6 for the “max.edge” option, 
with smaller values of max.edge giving finer meshes and better parameter estimates. 

The results of the model fitting are summarized in Table 2. The use of the exact 
likelihood to estimate the model delivers estimates in just 8.40 minutes, which is a 
major step forward for exact likelihood methods for a dataset of this size. The time 
required for INLA estimates depends on the resolution of the mesh, with finer meshes 
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Figure 4: Partitions used for the independent blocks approximation and incomplete inverse 
Cholesky approximation (Vecchia Approximation). 


method 


T 

K 

a 

Aloglik 

time (min 

exact with nugget 

0.1464 

51.25 

0.2095 

0.0048 

0 

8.40 

inc. inv. Choi. 

0.1468 

49.29 

0.2094 

0.0050 

-33.4 

5.23 

ind. blocks 

0.1485 

49.12 

0.2054 

0.0051 

-54.3 

4.15 

INLA, max.edge = 2 

0.1471 

58.19 

0.1715 

0.0061 

-158.9 

33.2 

exact no nugget 

0.1462 

33.63 

0.3648 

0 

-594.0 

0.41 

INLA, max.edge = 4 

0.1490 

79.02 

0.1126 

0.0077 

-795.4 

3.23 

INLA, max.edge = 6 

0.1497 

88.49 

0.0855 

0.0085 

-1344.1 

1.23 


Table 2: Results of analysis of Red Sea data. Loglikelihood differences are from maximum 
likelihood model. 

requiring more time. Estimating the parameters with the finest mesh took 33.2 min¬ 
utes, nearly four times longer than generating the maximum likelihood estimates. The 
coarser meshes were faster than maximum likelihood, but the quality of the parame¬ 
ter estimates found with INLA depends on the mesh, with coarser meshes producing 
worse parameter estimates. The finest mesh that ran faster than maximum likelihood 
gave estimates whose loglikelihood is 795 units below the maximum loglikelihood, and 
decreasing the resolution further gave a loglikelihood more than 1300 units below the 
maximum loglikelihood, so it appears that the decisions made in constructing the mesh 
are very important. Comparisons with INLA should be made with caution, since it is 
a Bayesian method, but it is interesting to note that, in terms of loglikelihoods, the 
finest mesh is outperformed by the much simpler independent blocks approximation, 
which required much less computational effort. The incomplete inverse Cholesky ap¬ 
proximation was better still and slightly slower than independent blocks. In INLA, 
we tried experimenting with different priors and did not see much dependence on the 
priors. 


5 Discussion 

The Gaussian Markov random field, especially when paired with a nugget effect, is a 
powerful tool for modeling spatial data. This paper outlines how the likelihood for a 
GMRF plus a nugget can be computed when the GMRF model is stationary and the 
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data locations form a nearly complete regular grid. Our new methods naturally handle 
edge effects, which we demonstrate can cause serious problems for estimating parame¬ 
ters if not treated carefully. The availability of the likelihood allows for both frequentist 
and Bayesian inference, and it allows for likelihood ratio comparisons among parameter 
estimates obtained using various approximate methods. Our methods for computing 
the likelihood require forming the covariance matrix for the partially neighbored ob¬ 
servations, and we show that the covariances can be computed efficiently and highly 
accurately with FFT algorithms. We make use of decompositions that exploit the fact 
that the inverse of the covariance matrix is mostly known and sparse. The likelihood 
calculations can be considered exact in the same sense that any calculation involv¬ 
ing Matern covariances is considered exact, since evaluation of the Matern covariance 
function involves series expansions for the modified Bessel function. 

The computational limits of our methods are determined by m„, the number of 
partially neighbored observations, since we must construct and manipulate dense ma¬ 
trices of size nin x rUn- For complete square grids, this allows the computation of the 
likelihoods with more than one million observations. However, our methods are not 
able to handle situations in which is much larger than 10 ^, which arises when the 
observation grid has many scattered missing values. In this situation, using an itera¬ 
tive estim ation method that relies on conditionally si r nulati ng the missing values given 


the data ( Stroud et al. . 2014 : Guinness and Fuentesl. 2014lf is appropriate. Then the 


methods described in this paper can be used to calculate the likelihood for the entire 
imputed dataset. 
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A Numerical Study 


We demonstrate that the stationary covariances for the random fields that we consider 
can be computed to high accuracy with small values of J = 2 or 3. To study the 
calculations, for each value of J G {1,2, 3,4, 5}, we compute the approximation in ([1|). 
and we record 


A 7 = max 


Kq{x, X + h; J,n) — Kg{x, x + h; J + l,n) 


where the maximum absolute difference is taken over h on a grid of size n = ( 100 , 100 ), 
for G (0,1}, and k G {1/5,1/10,1/20}, in order to represent a wide set of models. 
The results are plotted in Figure [5l We see that the calculations converge very rapidly, 
with A 3 less than 10 “^*^ for every parameter combination. 


B Proofs 

Theorem 1. If /(cj) is the spectral density for a Markov random field on W? and is 
bounded above, then for J > 2 and any p G N, there exists a constant Cp < 00 such 
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Figure 5: (+) represents n = 1/5, (o) represents k = 1/10, and (A) represents n = 1/20. 


that 


\Kg{x,y) - Ke{x,y; J)\ < 


Cr, 


(rei J)P 


Proof. Since the random field is Markov, l/f{u) can be written as a finite snm of 
complex exponentials and must be infinitely differentiable, and thus /(cj) is also in¬ 
finitely differentiable, since it is assumed to be bounded above. Since the process is 
stationary, we write h = x — y, and K{x,y) = K{h) to simplify the expressions. 
Guinness and Fnentes ( 20141 ) proved that 


K{h] J, n) = K{h + {Jn o k)), 

k&I? 

where nok = (rei/ci, 71 , 2^2 )• Thus 

Kg{h) - Ke{h-, J,n) = K{h + {Jn o k)). 

k^O 

Since f{uj) is p > 3 times continuously differentiable, there ex ists constant M p < 00 
such that < Mp||r||“^, where || • || is Euclidean distance ( Korner . 19891 . Lemma 

9.5). Then the error can be bounded by 


\Ke{h)-Kg{h- J,n)\ <Y,\K{h + {Jnok))\ 


00 

< 3Mp {{J - l)ni)-P + 4(2j + l)Mp{jJn,)-P 

00 

= 3Mp {{J - l)ni)-P + 4Mp{Jni)-P ^(2^ + l)(j) 

00 

< 3Mp{{J/2)ni)-P + 4Mp{Jni)-PYi^3 + W) 

{Jni)-P 


-p 




-p 


< 


3Mp (1/2)-P + 4Mp Yi‘^j + 
i=i 
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Setting Cp equal to the quantity in square brackets establishes the bound for p > 3. 
The p = 1 and p = 2 cases follow by setting Ci = C 3 and C 2 = C 3 , since (Jni)“^ < 
(Jni)-2 < □ 

We believe the following are known results but were unable to find proofs, so we 
prove them here for completeness. Lemma 2 is used in the proof of Lemma 1. 

Lemma 1. Let xi-n a finite set of observation locations for the infinite GMRF 
Z{-) specified by 6 , let Qq = the inverse of the covariance matrix for Z, and let 
Xi and Xj be two observation locations. If either Xi or Xj is fully neighbored, then 

Qeifij] = 0 {xi,Xj). 

Proof. Using the form of the density function of the multivariate normal, the density 
function of Z{xi) given Z-i {Z with the ith element removed) is 

p{z{xi)\z_i) oc exp - ]^z'Qgz^ 

ocexp ( - ]-z{xifQe[i,i] - z{xi)J2Qe['i,j]zixj)\ (7) 

where jz = {z{xi),..., z{xn)) is the argument for the joint density for Z, and z_j 
is z with the ith. element removed. Suppose, without loss of generality, that Xi is 
fully neighbored. Then, using the conditional specification in ([1]) and the form of the 
univariate normal distribution, the density function of Z{xi) given Z-i is 

p{z{xi)\z_i) oc exp ( - ^z{xife{xi,Xi) - z{xi)'^e{xi,Xj)z{xj)y (8) 

Comparing ([7|) and ([8]) proves that Qe[hj] = S{xi,Xj), since ([7]) and ([8|) must be equal 
for any z_j € The symmetry of Qg and 9 (Lemma [2|) implies that Qg\j,i] = 

Qe[i,j] = 9{xi,Xj) = 6 {xj,Xi), which establishes the stated result. □ 

Lemma 2. If 6 specifies a valid GMRF, 9{xi,Xj) = 9{xj,Xi) for all Xi,Xj G ifi . 

Proof. Let be a set of observation locations that includes Xi, Xj, Sg{xi), and 

Sg{xj). The conditional density of Z{xi) given is 

p{z{x.i)\z-i) oc exp ^ - ^z'Qgz 

ocexp ( - ]-z{xifQg[i,i] - z{xi)'^Qg[i,j]z{xj)y (9) 

Using the form of the univariate normal distribution and the conditional mean and 
variance of Z{xi) given its neigbors, the conditional density of Z{xi) given xi-n is 

p{z{xi)\z_i) oc exp ( - ]-z{xif‘e{xi,Xi) - z{xi)^e{xi,Xj)z{xj) \ . (10) 

Comparing ([9]) and (fT0]l and equating coefficients gives Qg[i,j] = 9{xi,Xj). Repeating 
the same steps for Xj gives Qg[j,i] = 9{xj,Xi). The symmetry of Qg implies that 
9{xi,xfij — 9{xj,xf^. Ell 
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C Example Plots of Simulated Data 

Figure [6] shows realizations from the six models considered in the simulation study 
from Section [3l 
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Figure 6: Examples of simulated data for the six parameter combinations. 
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